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Abstract. Mallows and Riordan showed in 1968 that labeled trees with a small number of 
inversions are related to labeled graphs that are connected and sparse. Wright enumerated 
sparse connected graphs in 1977, and Kreweras related the inversions of trees to the 
so-called "parking problem" in 1980. A combination of these three results leads to a 
' surprisingly simple analysis of the behavior of hashing by linear probing, including higher 

moments of the cost of successful search. 

C . 

The well-known algorithm of linear probing for n items in m > n cells can be described as 
follows: Begin with all cells (0, 1, . . . , m — 1) empty; then for 1 < k < n, insert the A;th item into the 
first nonempty cell in the sequence hk, (hk + 1) mod m, (hk + 2) mod m, . . . , where hk is a random 
integer in the range < hk < m. (See, for example, [4, Algorithm 6.4L].) 

The purpose of this note is to exhibit a surprisingly simple solution to a problem that appears 
in a recent book by Sedgewick and Flajolet [9]: 

Exercise 8.39 Use the symbolic method to derive the EGF of the number of probes 
required by linear probing in a successful search, for fixed M. 

The authors admitted that they did not know how to solve the problem, in spite of the fact that a 

"symbolic method" was the key to the analysis of all the other algorithms in their book. Indeed, 

the second moment of the distribution of successful search by linear probing was unknown when 

O [9] was published in 1996. 

OO ■ 

■ If the kth item is inserted into position qj~, the quantity d = J2k=i(lk — hk) mod m is the total 

c/2 • displacement of the items from their hash addresses. The average number of probes needed in a 

successful search is then 1 + d/n. Our goal in the following is to study the probability distribution 
of d as a function of the table size m and the number of items n. 

X 

1. Generating functions. Let D mn (x) = ^2x d , summed over all m n possible hash sequences 
hi . . . h n , and let F mn {x) be the same sum restricted to hash functions that are confined, in the 
sense that linear probing with h\ . . .h n will leave cell unoccupied. 

Given h\ . . . h n , the m hash sequences ((/ii + j) mod m . . . (h n + j) mod m) for < j < m all 
lead to the same total displacement d. And exactly (m — n)/m of them will be confined, in the 
sense above. Therefore D mn (x) = ™ F mn (x), and the probability generating function for d is 
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1.1 



The quantity F mn (x) is easier to deal with than D mn (x), since linear probing does not "wrap 
around" when the hash sequence is confined. We obviously have < hk < qk < m in a confined 
sequence; therefore remainders mod m are not actually taken and the behavior is simpler. 
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The special case of confined linear probing in which m = n + 1 has been called the parking 
problem [5], because we can think of n cars that try to park in n consecutive spaces, where the 
kth car starts its search in position h^. The number of sequences h\ . . .h n such that all cars are 
successfully parked is the number of confined hash sequences, namely m ~" m" = (n + l) n_1 , when 
m = n + 1. We will write 

F n (x) =F n+hn (x) (1.2) 
for the generating function of total displacement in the parking problem. 
The general case is clearly related to the special case m = n + 1 by 

F n + r ,n{ x )= V) — j — T ; F ni (x) F n2 (x) ... F nr (x), (1.3) 

nil no! • • • n r \ 

ni-\-ri2-\ \-n r =n 

because every confined hash sequence leaves r cells 

{0, n\ + 1, ni + n 2 + 2, . . . , m H 1- n r _i + r - 1} 

empty, and defines parking sequences on blocks of sizes ni + 1, n2+l, ■ ■ ■ , n r +l for some nonnegative 
integers ni, ri2, ■ ■ ■ , n r . The number of ways to fit such subsequences into h\ . . . h n is the multinomial 
coefficient rii/n^.n?). . . . n r \. 

Let 

F(x,z) = Y,F n (x)- (1.4) 

n>0 

generate the displacements of successfully parked cars. Equation (1.3) tells us that 

^^l = [z n }F(x,z) m - n ; (1.5) 

hence the bivariate generating function F(x,z) is the key to the distribution of total displacement. 

2. Solution to the parking problem. Suppose hi . . . h n is a confined hash sequence for the 
special case m = n + 1, with n > 1. This holds if and only if /i n > 1 and h\. . .h n -\ leaves cells 
and k empty for some k in the range h n < k < n. The sequence hi . . . h n -i then decomposes into 
parking subsequences for k — 1 and n — k cars. 

Therefore, by arguing as in (1.3) above, we see that the polynomials F n {x) satisfy the recurrence 

F n {x) = (I Z J) (! + x + ■ ■ ■ + * fc ~ 1 ) ^-iW Fn-k(x) . (2.1) 
k=l ^ ^ 

(The factor 1 + x + ■ • • + x fc_1 corresponds to the displacement of the nth car, while (^l}) is the 
number of ways to mix the two subsequences.) The first few values are 

F (x) = 1; 
F 1 (x) = l; 
F 2 (x) = 2 + x; 

F 3 (x) = Q + 6x + 3x 2 + x 3 . (2.2) 
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Recurrence (2.1) can be put into a more user-friendly form if we write 

A n (x) = (x - l) n F n (x) . (2.3) 
Then A (x) = 1, and for n > we have 

A n {x) =jz{ n k ~_ J) ( x " - m k -i(x)A n _ k (x) . (2.4) 

For fixed x, this recurrence can be analyzed by using the exponential generating functions 

oo n 

A(z) = ^A n (x) Z -, (2.5) 



n! 

n=0 



oo 

Z 



B(z) = Y J B n (x)-, (2.6) 
ft. 

n=l 

where 

B n (x) = (a; n -l)A l -i(x), (2.7) 

because (2.4) is then equivalent to 

A(z) = e B(z) , (2.8) 

by Euler's well-known formula for power series exponentiation (see, for example, exercise 4.7-4 
in [3]). 

Now (2.6) and (2.7) tell us that 

B(z) = C(xz) - C(z) , (2.9) 

where 

oo 



C(z) = £c»(*) (2.10) 

n=l 

C n (x) =A n _ 1 (x); (2.11) 



and we have 



z n ~ l A . , , z n 



C'(z) = C n (x) --— = A n (x) ^ = A(z) . (2.12) 

n = l n=0 

In other words C"(z) = e c ( xz )- c ( z ) ■ an d if we set 

G{z) = e c{z) (2.13) 

we find 

G'(z) = C'{z)G{z) = e c ^ xz) = G{xz) . (2.14) 
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But this functional relation is easy to solve, for if we set 

G{z)=Y,G n {x) Z - (2.15) 
the relation G(xz) = G'(z) says simply that x n G n (x) = G n+ \(x). Therefore 

G(z) = x n{n - 1)/2 Z — , (2.16) 

n=0 n ' 

and we have deduced that 

00 n 00 n 

J> - l)"" 1 F„_i(x) Z — x = C(z) = lnJ2 z n(n - 1)/2 Z — x • (2.17) 

n=l n=0 

3. Connected graphs. We are interested in the behavior of F n (x) near x = 1, so it is convenient 
to write x = 1 + iw. Then (2.17) becomes 

00 n 00 n 

V to"" 1 + W ) ^- = In V(l + w)"^- 1 '/ 2 ^ . (3.1) 

£ — ' n! z — ' n! 

n=l n=0 

Aha — the right side of this equation is well known as the exponential generating function for labeled 
connected graphs [8]. Thus we have 



w 



1 F n _x(l +w) = C n (l + w) = Y, w; edses(G) , (3.2) 



where the sum is over all connected graphs on n labeled vertices. 
From this interpretation of C n (w), we see that 

F n {\ + w) = C n>n+ i + w C n+ltn+ i + w 2 C n+2 ,n+l H , (3.3) 

where C m>n is the number of connected labeled graphs on n vertices and m edges. In particular, 
C„ in _|_i is (n + l) n_1 , the number of labeled trees on n + 1 vertices; this checks with the value of 
F n (l) that we already knew. 

4. Sparse connected graphs. Let 

W k (z) = Y,C n -l + k,n^ (4.1) 

n. 

n=l 

be the generating function for fc-cyclic components of a labeled graph; thus Wo(z) generates un- 
rooted trees, W'i(z) generates connected components that have exactly one cycle, W2(z) generates 
bicyclic components, and in general Wk(z) generates connected graphs that have k — 1 more edges 
than vertices. From (3.3) and (1.4) we have 

F(l + w,z) = Wq{z) + wW[{z) + w 2 W' 2 {z) + ■■■ . (4.2) 
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E. M. Wright [11] showed how to compute the W's systematically, and proved that they are all 
expressible in terms of the tree function 



OG 



T(z) = ^n- 1 -, (4.3) 



71 = 1 



which generates rooted trees. (See [2] for simplifications and extensions of Wright's results. In that 
paper, Wq{z), W\{z), and W2(z) are called respectively U(z), V{z), and W(z).). 

The known results about Wk(z) for small k show that we have 

F(l + w,z) = ^f(w,T(z)), (4.4) 



where f(w,t) has the following leading terms: 
f(w,t) = l + w 



t 2 



2(1 -tf 

5 t 4 _ 1 t 3 



73 t 5 3 t 4 

+ 48(T^< 6 - 2t) + 4(r^( 5 - 2 ') 

+ ^(T^F< 4 - 2( > 

4 / 1105 t 10 . 

+ ••• (4.5) 
(See formula (8.13) in [2], and use the fact that zT'(z) = T(z)/(l - T(z)).) 

5. Application to linear probing. We can now put everything together and calculate factorial 
moments of the distribution of total displacement when n items are inserted into m cells by linear 
probing. The tree function has a wonderful property that leads to considerable simplification, 
thanks to Lagrange's inversion formula and the identity T(z) = ze T<yZ ^ : 

[z n }F(l + w,z) m - n 

T{z) m - n f(w,T(z)) m ~ n 



= [z m ]T{z) m - n f(w,T(z)) m ~ n 

= [t n ]e mt (l-t)f(w,t) m - n . (5.1) 
(See [3], third edition, exercise 4.7-16, for a simple algorithmic proof of Lagrange's formula.) 
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We will need to use the functions 

Qr{m, n) = ( r f ) + ( r + ') ± + ( T + 2 ) ^1 + ■ ■ ■ = 2 F (r + 1, -n ; ; -1/m) , (5.2) 



OJ \ I J m 

which are known to appear in the analysis of linear probing (see [4] , Theorem K) ; they have the 
simple generating function 

~ t n e * 

V Q r (m, n) — = — . (5.3) 

^— ' n\ 1 — t m) r+L 

The formulas above now allow us to compute the expected total displacement as follows, using 
(5.3) and (4.5): 

[wz n ]F(l + w,z) m - n 
[z n ] F(l, z ) m ~ n 

[t n ] e mt (l -t){m- n) t 2 /{2{\ - t) 2 ) 
~ [t n ] e mt (l-t) 

_ \ (m - n) [t n ] e mt t(l/{l -t)-l) 
m n /n\ — m n ~ 1 /{n — 1)! 

\{m-n) m"" 1 (Q {m, n - 1) - l) /(n - 1)! 



(m — n) m n 1 /n\ 



n 



-(Qo(m,n-l)-l). (5.4) 

This agrees with the known result that a successful search requires \ (Qo(m, n — 1) + l) probes, on 
the average [4, Theorem K]. 

Moreover, a similar calculation gives 
[w 2 z n ]F(l + w,z) m ~ n 



[z n ] F(l,z) m - n 

= ^ ~ ~ 2) (l5Q 3 (m, n - 3) + (4 + 3m - 3n)Q 2 (m, n - 3) 

+ (5-3m + 3n)Qi(m,n-3)) . (5.5) 

This is the expected value of (g), from which of course we obtain the expected value of d 2 by 
doubling and adding (5.4). All moments can in principle be obtained in this way, although the 
expressions get more and more complicated. 

Formulas such as (5.5) can be rewritten in many ways using the identities 

rQ r (m, n) = mQ r _2{m-> n) — (m — n — r)Q r -i(m, n) ; (5.6) 
rQ r (m, n) = mQ r -i(m, n + 1) — mQ r -i(m, n) ; (5-7) 
nQ r (m, n — 1) = mQ r (m, n) — mQ r _i(m, n) . (5.8) 

However, none of these transformations seems to convert (5.5) into a substantially simpler formula. 
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6. Related work. Germain Kreweras [6] discussed the polynomials F n (x) at length, showing that 
they are the generating functions for "suites majeures," which are equivalent to parking sequences 
with displacements enumerated. He also showed that F n (—1) is the number of "up-down" permu- 
tations, and that F n (x) is the generating function for inversions in a labeled tree of n + 1 nodes. 
The concept of inversions in trees was first defined by Colin Mallows and John Riordan [7], who 
established their relation to connected graphs. Thus, all of the main ideas of sections 2, 3, 4 were 
already in the literature, waiting to be applied to the analysis of linear probing. 

A one-to-one correspondence that maps labeled trees on {0,1,..., n} with k inversions bi- 
jectively into parking sequences on {1, . . . ,n} with k displacements appears in [4, second edition, 
answer to exercise 6.4-31]. A beautiful construction that uses depth-first search to establish (3.2), 
by relating each ra-node tree with k inversions to 2 k connected graphs having w n (l + w) k edges, 
was found by Ira Gessel and Da-Lun Wang [1]. Therefore the relation between linear probing and 
graphs can be made quite explicit, although there is apparently no really simple connection. 

The expected value of d 2 was first obtained by Alfredo Viola and Patricio Poblete [10], who 
discovered a formula equivalent to (5.5) about one week before the author had independently carried 
out the calculations above. Their starting point was equivalent to the symmetry-breaking strategy 
of section 1; their other methods provide an interesting alternative to those of the present note. 

7. Personal remarks. The problem of linear probing is near and dear to my heart, because 
I found it immensely satisfying to deduce (5.4) when I first studied the problem in 1962. Linear 
probing was the first algorithm that I was able to analyze successfully, and the experience had a 
significant effect on my future career as a computer scientist. None of the methods available in 1962 
were powerful enough to deduce the expected square displacement, much less the higher moments, 
so it is an even greater pleasure to be able to derive such results today from other work that has 
enriched the field of combinatorial mathematics during a period of 35 years. 

It is also gratifying to know that the field of algorithmic analysis has matured to the point 
where researchers in different parts of the world are now able to resolve such difficult problems 
working independently. 

The reader will note that Sedgewick and Flajolet's exercise 8.39 has not truly been solved, 
strictly speaking, because we have not found the EGF ^2™ = q F mn (x) z n /nl as requested. However, 
Sedgewick and Flajolet should be happy with any analysis of linear probing that uses symbolic 
methods associated with generating functions in an informative way. 

I thank the referees for their perceptive remarks and valuable suggestions. 

Finally, I wish to pay tribute to my secretary of more than twenty-five years, Phyllis Astrid 
Benson Winkler, who is retiring this year. The present paper is the last of more than one hundred 
that she has typed and typeset beautifully for me at Stanford. 
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